Draft ProForma implementation #37

mobiusklein · 2021-03-29T02:53:18Z

This is a draft implementation for reading and writing the ProForma notation for modified amino acid sequences. I worked to avoid adding dependencies here by making additional controlled vocabularies optional unless you try to parse a string that uses them, and then load them lazily from psims.

It still needs more documentation (especially about how to interact with some of its implementation details and which feature annexes it supports) and tests. I can likely inherit several of those from https://github.com/topdownproteomics/sdk/blob/master/tests/TopDownProteomics.Tests/ProForma/ProFormaParserTests.cs.

The ProForma specification is going through review now, but there's already discussion of an update to allow multiple modifications at a single position.

…eature/proforma

…o feature/proforma

…Flesh out the Generic modification resolver;

…eature/proforma

…ingering issues.

mobiusklein · 2021-05-24T03:31:07Z

@levitsky This should finally be ready for review, with the fundamental functionality all in place.

This adds the parse_proforma function to parse a string in ProForma 2.0 format into a list of peptide position tokens and a dictionary of additional modification information (unlocalized, ambiguous or labile modifications, global modification rules, and so on), a to_proforma function to take that information and turn it back into a ProForma 2.0 string. It also includes a ProForma class which layers on a little more behavior like mass calculation, slicing, and searching for tags by ID.

The non-user-facing bits include all the baroque machinery for dealing with six different modification vocabularies, a more forgiving tokenizer, and a slightly borrowed test suite.

There is still some documentation to iron out, especially which "implementation level" this counts as, as it implements everything but inter-peptide cross-linking support. There's also how to make the users aware of how to control how additional controlled vocabularies are loaded. Right now it uses Unimod directly from pyteomics.mass.Unimod, but tries to import psims to load the rest, emitting an error message if it needs one of those databases and psims isn't installed.

levitsky

This is huge, thank you!
I left some questions/comments in the code. I don't have any use cases, but I was able to catch a couple of issues by trying to parse examples from ProForma README. We can try copying those into tests and adding a psims install to the GA workflow.
Otherwise, my only real concern is formula parsing. I left a comment about it in the code, too.
Thank you once again for the awesome work.

pyteomics/proforma.py

levitsky · 2021-05-31T14:21:30Z

pyteomics/proforma.py

+from pyteomics.auxiliary import PyteomicsError, BasicComposition
+from pyteomics.auxiliary.utils import add_metaclass
+
+# To eventually be implemented with pyteomics port?


Is there anything you don't like about this dependency?

pyteomics/proforma.py

levitsky · 2021-05-31T14:28:33Z

pyteomics/proforma.py

+    load_psimod = partial(_needs_psims, 'PSIMOD')
+    load_xlmod = partial(_needs_psims, 'XLMOD')
+    load_gno = partial(_needs_psims, 'GNO')
+    obo_cache = None


This name does not seem to be used, is it necessary?

The name gets baked into the partial-made function so that the error is clear about which "entity" depended upon the other source, e.g.

>>> load_psimod() ImportError: Loading PSIMOD requires the `psims` library. To access it, please install `psims`

Technically, we could just make the message "Loading this controlled vocabulary requires psims." and be done, but it feels less explicit.

To be clear, this comment of mine referred to obo_cache only. It is not used in pyteomics code. Github displays four lines for context when the last one is the one I put the comment on.

Thank you for the clarifying. Yes, that variable is imported to allow proforma to expose control over the default file cache from psims. That way users could set the cache directory or disable the cache altogether if they wished. Otherwise they'd need to explicitly import it from psims. It probably isn't necessary to import it here and just needs to be documented clearly that it interacts with the psims cache mechanism.

Thank you! Indeed, I think it's worth mentioning in the docs because the user may not even know that psims is used, or how it works with caches.

pyteomics/proforma.py

Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

mobiusklein · 2021-05-31T20:36:41Z

levitsky

Thank you for solving the isotope parsing issue and laying out the supported features, this is very helpful. (Also for all the smaller edits.)
I noticed another couple of minor issues with the help of pyflakes.
Other than that, one question/suggestion I have is if you would agree with naming the entry level function just parse, for the sake of brevity and consistency (see parser.parse and all the read functions).

pyteomics/proforma.py

levitsky · 2021-06-01T14:06:03Z

pyteomics/proforma.py

+
+    def __getitem__(self, i):
+        if isinstance(i, slice):
+            props = self.properties.copy()


Suggested change

props = self.properties.copy()

pyteomics/proforma.py

levitsky · 2021-06-01T14:08:32Z

pyteomics/proforma.py

+    load_psimod = partial(_needs_psims, 'PSIMOD')
+    load_xlmod = partial(_needs_psims, 'XLMOD')
+    load_gno = partial(_needs_psims, 'GNO')
+    obo_cache = None


Thank you! Indeed, I think it's worth mentioning in the docs because the user may not even know that psims is used, or how it works with caches.

pyteomics/proforma.py

…eature/proforma

levitsky · 2021-06-15T10:28:43Z

At this point I'm more than happy with the state of this PR. Please let me know if/when you think it's ready to merge.

mobiusklein · 2021-06-17T00:26:26Z

Thank you.

There's another ProForma meeting tomorrow which may or may not introduce more changes. The ambiguous sequence region feature was a late addition. We'll see if more work is needed or if there are any comments from the group.

mobiusklein · 2021-06-28T02:37:43Z

I've updated the documentation on psims to discuss the caching mechanism in a bit more detail. No new features have been added to ProForma since the last meeting, and likely the best way to get more feedback at this point is for people to try to use it. If you're satisfied with the level of documentation within the module itself, we can merge it.

mobiusklein added 15 commits December 12, 2020 15:45

experimenting

c98feca

Merge branch 'master' of https://github.com/levitsky/pyteomics into f…

7213b5e

…eature/proforma

A draft parser for ProForma without any semantics on the returned object

c79c3e8

Updates to the notebook

8c5301e

Add cv resolver

2dbe1c8

Clean up markers

0b79ccc

More proforma parsing experiments

d069380

Merge branch 'master' of https://github.com/mobiusklein/pyteomics int…

5d7ec19

…o feature/proforma

Fix out-of-order monosaccharide formulae;Add support for the Obs tag;…

ed2cacf

…Flesh out the Generic modification resolver;

Merge branch 'master' of https://github.com/levitsky/pyteomics into f…

df04d0f

…eature/proforma

Merge branch 'master' of https://github.com/levitsky/pyteomics into f…

66cdcf9

…eature/proforma

Add multimod example

527b820

Prepping for draft PR

4afde59

Merge branch 'master' of https://github.com/levitsky/pyteomics into f…

9039b28

…eature/proforma

Add support for multiple tags per position, add tests, and fix some l…

34d36db

…ingering issues.

mobiusklein marked this pull request as ready for review May 24, 2021 03:13

mobiusklein added 3 commits May 23, 2021 23:14

No f-strings

35b4658

Use explicit super

feda7d0

Add unknown amino acid

8be8fc5

mobiusklein added 2 commits May 27, 2021 11:53

Fix terminal masses

5508775

update test

7caff0a

levitsky reviewed May 31, 2021

View reviewed changes

mobiusklein and others added 5 commits May 31, 2021 12:14

Fully support all the required additional amino acids

2b9402b

Remove duplicated undehydrated selenocysteine mass

5f5166e

Properly handle nested braces and isotopes

5937299

Update pyteomics/proforma.py

53c330a

Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

Update pyteomics/proforma.py

e77ca1a

Co-authored-by: Lev Levitsky <lev.levitsky@phystech.edu>

Add compliance level to documentation

25fde39

levitsky reviewed Jun 1, 2021

View reviewed changes

pyteomics/proforma.py Outdated Show resolved Hide resolved

mobiusklein added 7 commits June 2, 2021 20:54

Fix up glycan mass calculation

da164db

Fix slice behavior

8031207

Simplify, more documentation

293b050

Merge branch 'master' of https://github.com/levitsky/pyteomics into f…

29251d8

…eature/proforma

Add ambiguous sequence regions

325088d

ProForma testing requires psims

c544765

ci

43fcebe

levitsky merged commit 4cee0bb into levitsky:master Jun 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft ProForma implementation #37

Draft ProForma implementation #37

mobiusklein commented Mar 29, 2021

mobiusklein commented May 24, 2021

levitsky left a comment

levitsky May 31, 2021

levitsky May 31, 2021

mobiusklein May 31, 2021

levitsky Jun 1, 2021 •

edited

Loading

mobiusklein Jun 1, 2021

levitsky Jun 1, 2021

mobiusklein commented May 31, 2021

levitsky left a comment

levitsky Jun 1, 2021

levitsky Jun 1, 2021

levitsky commented Jun 15, 2021

mobiusklein commented Jun 17, 2021

mobiusklein commented Jun 28, 2021

Draft ProForma implementation #37

Draft ProForma implementation #37

Conversation

mobiusklein commented Mar 29, 2021

mobiusklein commented May 24, 2021

levitsky left a comment

Choose a reason for hiding this comment

levitsky May 31, 2021

Choose a reason for hiding this comment

levitsky May 31, 2021

Choose a reason for hiding this comment

mobiusklein May 31, 2021

Choose a reason for hiding this comment

levitsky Jun 1, 2021 • edited Loading

Choose a reason for hiding this comment

mobiusklein Jun 1, 2021

Choose a reason for hiding this comment

levitsky Jun 1, 2021

Choose a reason for hiding this comment

mobiusklein commented May 31, 2021

levitsky left a comment

Choose a reason for hiding this comment

levitsky Jun 1, 2021

Choose a reason for hiding this comment

levitsky Jun 1, 2021

Choose a reason for hiding this comment

levitsky commented Jun 15, 2021

mobiusklein commented Jun 17, 2021

mobiusklein commented Jun 28, 2021

levitsky Jun 1, 2021 •

edited

Loading